Markov Decision Processes
Markov Process
Markow process formally describes an environment for reinforcement learning, where the environment is fully observable. The environment's response to an agent's action is probabilistic, and it is described by a set of probabilities. The agent and the environment interact at each of a sequence of discrete time steps. The agent selects an action, and the environment responds by presenting the agent with a reward and the next state. The environment's response at time t depends only on the state and action at time t. The process is described by a 5-tuple, (S, A, P, R, γ), where:
- S is a finite set of states,
- A is a finite set of actions,
- P is a state transition probability matrix,
- R is a reward function, R(s, a, s'),
- γ is a discount factor, γ ∈ [0, 1].
"The future is independent of the past given the present." This is the Markov property.
Definition: A state S(t) is Markov if and only if,
The state captures all the relevant information from the history. Once the state is known, the history may be thrown away. The state is a sufficient statistic of the future.
State Transition Probability Matrix
For a Markov state
The state transition probability matrix, P, is a square matrix of size |S| x |S|, where |S| is the number of states. The element P(s, s') is the probability of transitioning from state s to state s'.
Each row of the probability matrix sums to 1, which means that the sum of the probabilites from a single state to all other possible states is 1.
A Markow process is a memoryless random process, where the next state depends only on the current state and not on the sequence of events that preceded it.
Definition: A Markov process (or Markov chain) is a tuple
, where S is a finite set of states, and P is a state transition probability matrix.
Example: Consider a simple weather model with 4 states: sunny, cloudy, rainy, and foggy. The state transition probability matrix is given by:
Markov Reward Process
A Markov reward process is a Markov chain with values. It is a tuple
- S is a finite set of states,
- P is a state transition probability matrix,
- R is a reward function,
is a discount factor, .
The reward function R(s, s') defines the immediate reward received after transitioning from state s to state s'. The discount factor
Definition: A Markov reward process is a tuple
, where S is a finite set of states, P is a state transition probability matrix, R is a reward function, and is a discount factor.
Definition: The return
is the total discounted reward from time-step t.
Discount factor is the present value of future rewards. It is a value between 0 and 1. The purpose of discounting is to make the sum of rewards finite.
- gamma = 0: the agent is short-sighted "myopic" and only considers immediate rewards.
- gamma = 1: the agent is far-sighted and considers future rewards with equal weight.
Why do we use discount factor ?
- to make reward finite / mathematically well-defined
- to have a stable solution / to avoid infinite rewards in cyclic environments
- to not rely on future rewards due to the uncertainty of the future
- immidiate reward more valuable since it is not delayed
- it is how animals behave in nature
Value Function
The value function
Definition: The state value function
of a Markov reward process is the expected return starting from state s.
The value function is the expected return starting from state s. It estimates how good it is for the agent to be in a particular state. The value function is a prediction of future rewards.
Bellman Equation
The Bellman equation is a fundamental equation in dynamic programming. It decomposes the value function into two parts: immediate reward and discounted value of the next state.
Definition: The Bellman equation for the state value function
is given by:
The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It is a recursive equation that decomposes the value function into two parts: immediate reward and discounted value of the next state.
Matrix Form:
The Bellman equation is a linear equation that can be solved for the value function v.
There are many methods to iteratively solve the Bellman equation, such as monte carlo, dynamic programming, and temporal difference learning.
Markov Decision Process
Markov Decision Process (MDP) is a Markov reward process with decisions. It is a tuple
- S is a finite set of states,
- A is a finite set of actions,
- P is a state transition probability matrix,
- R is a reward function,
is a discount factor, .
The agent interacts with the environment by taking actions. The agent selects an action, and the environment responds by presenting the agent with a reward and the next state. The environment's response at time t depends on the state and action at time t.
Definition: Policy
is a distribution over actions given states,
The policy defines the agent's behavior. It is a distribution over actions given states. The policy can be deterministic or stochastic.
MDP policies depend on the current state, not the history. The agent-environment interaction is a Markov process.
Given a Markov Decision Process (MDP)
- The state sequence
is a Markov process , - The state and reward sequence
is a Markov reward process .
Value Function with Policy
The value function
Definition: The state-value function
of a Markov Decision Process (MDP) is the expected return starting from state s, following policy .
Definition: The action-value function
of a Markov Decision Process (MDP) is the expected return starting from state s, taking action a, and following policy .
The value function is the expected return starting from state s, following policy
Bellman Equation with Policy
The Bellman equation with policy is a fundamental equation in dynamic programming. It decomposes the value function into two parts: immediate reward and discounted value of the next state under policy
-
State-Value Function:
-
Action-Value Function:
%%{init: { "flowchart": { "htmlLabels": true, "curve": "linear" } } }%% graph TB subgraph 4.[4.] direction TB style 4. color:#000, fill:#fff, stroke:#ddd; style a6 fill:#222,stroke:#333,stroke-width:2px; style a7 fill:#222,stroke:#333,stroke-width:2px; style a8 fill:#222,stroke:#333,stroke-width:2px; style a9 fill:#222,stroke:#333,stroke-width:2px; style a10 fill:#222,stroke:#333,stroke-width:2px; style s10 fill:#fff,stroke:#333,stroke-width:2px; style s11 fill:#fff,stroke:#333,stroke-width:2px; a6(( )) --- s10(( )) a6 --- s11(( )) s11 --- a7(( )) s11 --- a8(( )) s10 --- a9(( )) s10 --- a10(( )) end subgraph 3.[3.] direction TB style 3. color:#000, fill:#fff, stroke:#ddd; style a4 fill:#222,stroke:#333,stroke-width:2px; style a5 fill:#222,stroke:#333,stroke-width:2px; style s5 fill:#fff,stroke:#333,stroke-width:2px; style s6 fill:#fff,stroke:#333,stroke-width:2px; style s7 fill:#fff,stroke:#333,stroke-width:2px; style s8 fill:#fff,stroke:#333,stroke-width:2px; style s9 fill:#fff,stroke:#333,stroke-width:2px; s5(( )) --- a4(( )) s5 --- a5(( )) a5 --- s6(( )) a5 --- s7(( )) a4 --- s8(( )) a4 --- s9(( )) end subgraph 2.[2.] direction TB style 2. color:#000, fill:#fff, stroke:#ddd; style a3 fill:#222,stroke:#333,stroke-width:2px; style s3 fill:#fff,stroke:#333,stroke-width:2px; style s4 fill:#fff,stroke:#333,stroke-width:2px; style h2 fill:#fff,stroke:#fff; a3(( )) --- s3(( )) a3 --- s4(( )) s4 --- h2(( )) linkStyle 14 stroke:#fff,stroke-width:2px; end subgraph 1.[1.] direction TB style 1. color:#000, fill:#fff, stroke:#ddd; style a2 fill:#222,stroke:#333,stroke-width:2px; style a1 fill:#222,stroke:#333,stroke-width:2px; style s1 fill:#fff,stroke:#333,stroke-width:2px; style h1 fill:#fff,stroke:#fff; s1(( )) --- a1(( )) s1 --- a2(( )) a2 --- h1(( )) linkStyle 17 stroke:#fff,stroke-width:2px; end
Optimal Value Function
The optimal value function
Definition: The optimal state-value function
of a Markov Decision Process (MDP) is the expected return starting from state s, following the optimal policy.
Definition: The optimal action-value function
of a Markov Decision Process (MDP) is the expected return starting from state s, taking action a, and following the optimal policy.
The optimal value function is the expected return starting from state s, following the optimal policy. It estimates how good it is for the agent to be in a particular state under the optimal policy. The optimal value function is a prediction of future rewards.
Optimal Policy: The optimal policy
Theorem: For any Markov Decision Process (MDP),
- There exists an optimal policy
that is better than or equal to all other policies, . - All optimal policies achieve the optimal value function,
. - All optimal policies achieve the optimal action-value function,
.
To find the optimal policy, we need to find the optimal value function, and then derive the optimal policy from the optimal value function.
- There is always at least one deterministic optimal policy for any MDP.
- There may be more than one deterministic optimal policy for an MDP.
Bellman Optimality Equation
The Bellman optimality equation is a fundamental equation in dynamic programming. It decomposes the optimal value function into two parts: immediate reward and discounted value of the next state under the optimal policy.
-
State-Value Function:
-
Action-Value Function:
-
State to State Transition:
-
Action to Action Transition:
- Bellman Optimality Equation is non-linear
- No closed-form solution for the optimal value function
- Iterative methods are used to solve the Bellman Optimality Equation
- Value Iteration, Policy Iteration, Q-Learning, Sarsa, DQN, etc.
#MMI706 - Reinforcement Learning at METU